Script Identification from Bilingual Gujarati-English Documents

نویسندگان

Shailesh A. Chaudhari

Ravi M. Gulati

چکیده

In a multi-lingual country like India, in most of the official papers, school text books, magazines, it is observed that English words intersperse within the Indian regional languages. So a bilingual Optical Character Recognition (OCR) system is needed which can recognize these bilingual documents and store it for future use. In this paper authors present an OCR system developed for the script identification of Indian language i.e. Gujarati and Roman scripts for printed documents. Here authors propose the line-wise script identification. The spatial spread of pixels on Upper and Lower parts associated with Gujarati and English are used to identify the script. Authors have used horizontal projection for line distinction belonging to different script. Further, K-nearest neighbour algorithm is used to classify 2000 text lines into two scripts: English and Gujarati, based on 4 spatial spread features extracted using connected component and horizontal projection. The proposed algorithm achieves average classification accuracy as high as 99.70% for bi-script separation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Character Level Separation and Identification of English and Gujarati Digits from Bilingual (English-Gujarati) Printed Documents

Nowadays, it is observed that English script has interspersed within the Indian languages. So there is a need for an optical character recognition (OCR) system which can recognize these bilingual documents and store it for future use. Hence, in this paper an OCR system is proposed that can read documents containing Gujarati and English scripts (Only digits). These scripts have many features in ...

متن کامل

Identification of Printed Punjabi Words and English Numerals Using Gabor Features

Script identification is one of the challenging steps in the development of optical character recognition system for bilingual or multilingual documents. In this paper an attempt is made for identification of English numerals at word level from Punjabi documents by using Gabor features. The support vector machine (SVM) classifier with five fold cross validation is used to classify the word imag...

متن کامل

Wavelet Packet Based Texture Features for Automatic Script Identification

In a multi script environment, an archive of documents printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify the script type of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in ten Indian scripts ...

متن کامل

A Comparative Analysis of Classifiers Accuracies for Bilingual Printed Documents (Oriya-English)

Bilingual document recognition has been the subject of intensive research and our focus is on the recognition of an Oriya-English bilingual documents. In most of our official papers, school text books, it is observed that English words interspersed within the Indian languages. So there is need for an Optical Character Recognition (OCR) system which can recognize these bilingual documents and st...

متن کامل

Morphological Reconstruction for Word Level Script Identification

A line of a bilingual document page may contain text words in regional language and numerals in English. For Optical Character Recognition (OCR) of such a document page, it is necessary to identify different script forms before running an individual OCR system. In this paper, we have identified a tool of morphological opening by reconstruction of an image in different directions and regional de...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Script Identification from Bilingual Gujarati-English Documents

نویسندگان

چکیده

منابع مشابه

Character Level Separation and Identification of English and Gujarati Digits from Bilingual (English-Gujarati) Printed Documents

Identification of Printed Punjabi Words and English Numerals Using Gabor Features

Wavelet Packet Based Texture Features for Automatic Script Identification

A Comparative Analysis of Classifiers Accuracies for Bilingual Printed Documents (Oriya-English)

Morphological Reconstruction for Word Level Script Identification

عنوان ژورنال:

اشتراک گذاری